Introduction

The first official case of COVID-19 in the USA has been confirmed on the 21st of January. About three months later, almost 1 million cases have been discovered. In this context of this pandemic, it is of utmost importance to understand how the pandemic evolves by reporting data in a clear an insgihtful way.

This project has two main goals:

  • visualize different COVID-related metrics: infection rate / 100k individuals, total cases, and total deaths at the county level (by representing maps and time series)

  • identify counties with potential errors in official counts: it has been shown that some counties negative counts in cumulative cases, which is not possible.

Notes about the report:

  • R code used to generate this report is provided. You just need to click on the “Code” button on the right to display the code used in a given section.

  • Most graphs and tables are interactive: you can zoom in and out, click on elements to display more content, or search for specific data points.

Data preparation

US states and counties map based on NYT COVID data

We want to represent our COVID data on a map of the USA. To do so, we will generate maps using the Leaflet framework. The idea is to add polygons representing counties to the basemap. Polygon coloring depends on the metric of interest (here, total cases or rate /100k). Clicking on a county gives more information about this area. Finally, we can represent the rate of cases per 100,000 individuals, with NYT data this time (by first taking the most recent timepoint available in the dataset).

Evolution of COVID cases over time (New York Times data)

We can now have a look at COVID data over time. Let’s represent the evolution of the total number of cases in the most affected counties. You can select specific counties by (double) clicking on the legend on the right.

Evolution of cases in each state

Evolution of total cases in each county

For example in North Carolina

North Carolina

Evolution of relative cases in each county

For example in North Carolina

Identifying counties with unexpected patterns

It has been observed that some data might be erroneous as the cumulative number of cases sometimes go down, which is not supposed to happen. We need to identify the reasons underlying these observations. To do so, we’ll first automatically identify counties reporting a negative difference in the cumulative number of cases from one day to the next one.

List of counties with negative differences

The idea now is to compute the difference in case values from one day to the next one in order to identify potential negative difference (cumulative number of cases going down). We can then report the counties presenting such negative difference in a table.

Time series for counties with largest discrepancies

We can have a look at the time series of some counties presenting a large negative difference in cases number (> 10).

##  [1] "Cullman 01043"     "Granville 37077"   "Onondaga 36067"   
##  [4] "Tazewell 17179"    "Dougherty 13095"   "Carson City 32510"
##  [7] "Ripley 18137"      "Oakland 26125"     "Madison 01089"    
## [10] "Lafayette 22055"   "Rensselaer 36083"  "St. Charles 22089"
## [13] "Tuscaloosa 01125"  "Lexington 45063"   "St. Landry 22097"